Terminology note: an early version of this document used the term branch to refer to what is now called a path, where a path is a linked sequence of coordinates. Technically a path can be composed of a single vertex, which is partly why a different term was used originally. There are some abbreviations “b” and “bXv” below that still refer to this old usage, they stand for “path” and “paths-link-vertex” respectively.
In this document I describe a “normal-form” that provides a very general way of extending the traditional GIS forms, and is a bridge between vector and raster rather than being a different form altogether. The purpose of this document is to advocate for this general form of data organization that can be used for new extended uses of GIS. I’m not arguing that this be used in place of other optimized forms, although it can be: I am interested in operations that simply cannot be performed already.
When we talk about vector data in geo-spatial, we have at least three levels of hierarchy to consider in the data structures.
GIS tools typically only provides direct access to the objects, though the relations between paths and coordinates can sometimes be seen.
We generally cannot store information against the paths or the coordinates, beyond what they inherently are defined by. For coordinates this is the X and Y values, and/or the longitude and latitudes, but simple features does provide the ability to store a “third” coordinate “Z” and or a measure coordinate “M”. M is typically used for “linear referencing”, and not a more general multidimensional geometry (like time).
I’ll use the countries example from a GeoPackage file provided here. I use R because I can tell the complete story with it, in a concrete and reproducible way.
Read in a polygon vector layer in traditional GIS form, and plot it.
#library(rworldmap)
#data(countriesLow)
#p <- countriesLow
library(rgdal)
p <- readOGR(system.file("extdata", "small_world.gpkg", package = "anglr"), "ne_110m_admin_0_countries")## OGR data source with driver: GPKG
## Source: "/perm_storage/home/mdsumner/R/x86_64-pc-linux-gnu-library/3.4/anglr/extdata/small_world.gpkg", layer: "ne_110m_admin_0_countries"
## with 177 features
## It has 1 fields

This object p presents a “data frame” (i.e. a table) front-end that we can query and use at the objects level, much like in some GIS software we can easily select a single object or row in a table.
library(spbabel)
pnganz <- subset(p, name %in% c("Australia", "Indonesia", "New Zealand", "Papua
New Guinea"))
pnganz## class : SpatialPolygonsDataFrame
## features : 3
## extent : 95.29303, 178.5171, -46.64124, 5.479821 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## variables : 1
## # A tibble: 3 x 1
## name
## * <fct>
## 1 Australia
## 2 Indonesia
## 3 New Zealand

Looking at the object’s underlying geometric structure shows nested lists of matrixes of x,y coordinates. There is one matrix per branch, analogous to the way that feature parts are nested in standard Geom binary forms like WKB. Each component path stores extra information about whether it is a hole, the ring direction, a coordinate for label plotting and so on. We otherwise cannot store any more information on the component parts though.
NOTE: the Spatial classes here are pre-simple features, so they are more analogous to the structures in a shapefile in that a polygon hole’s “island parent”" may be ambiguous, but this is not so important to this story. R now has simple features in the sfr project here, which adds Z, M and the possibility of some of the exotic types as well. https://github.com/edzer/sfr
These hierarchical structures can be serialized and stored in different ways, typically they are stored as binary geoms and stored directly in a table.
An interesting aspect here is that these structures don’t describe the topology of the objects in any special way, these are just paths of coordinates, and when they are plotted or used in analysis there’s a rule about how space is enclosed by a closed path. If we treat them as lines (as SpatialLinesDataFrame), the only difference is to not treat them as enclosed paths. Literally the only difference in the structure of this object from the polygons version is the name of the class, and the behaviour that methods invoked for this object will provide.
plot(as(pnganz, "SpatialLinesDataFrame"))
plot(as(pnganz, "SpatialLinesDataFrame"), col = viridis::viridis(nrow(pnganz), alpha = 0.7), lwd = c(2, 4), add = TRUE)
str(as(geometry(pnganz[1,]), "SpatialLines"))## Formal class 'SpatialLines' [package "sp"] with 3 slots
## ..@ lines :List of 1
## .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots
## .. .. .. ..@ Lines:List of 2
## .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot
## .. .. .. .. .. .. ..@ coords: num [1:17, 1:2] 145 146 147 148 148 ...
## .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot
## .. .. .. .. .. .. ..@ coords: num [1:224, 1:2] 144 144 145 145 145 ...
## .. .. .. ..@ ID : chr "9"
## ..@ bbox : num [1:2, 1:2] 113.3 -43.6 153.6 -10.7
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:2] "x" "y"
## .. .. ..$ : chr [1:2] "min" "max"
## ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
## .. .. ..@ projargs: chr "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
If we convert these data to “normal form”, we actually need at least three tables, one each for the objects, the paths, and the coordinates (vertices). The map_table function in the spbabel package creates these but also adds another link table between paths and vertices to enable de-duplication of shared vertices. The de-duplication is required for triangulating the polygons, and other topological operations.
## please note, map_table was a very early silicate::PATH
ptabs <- spbabel::map_table(pnganz)
print(names(ptabs))## [1] "o" "b" "bXv" "v"
print(sapply(ptabs, nrow))## o b bXv v
## 3 17 557 540
Now it’s a bit clearer how the underlying entities are put together. Each table here has a unique relational id, this allows us to subset and recombine these tables without having to convert indexes each time.
The objects.
ptabs$o## # A tibble: 3 x 2
## name object_
## * <fct> <chr>
## 1 Australia eUUAPKPXPv
## 2 Indonesia HHVqqD0Iey
## 3 New Zealand 8WJJAfmCBf
The paths record which object they belong to.
ptabs$b## # A tibble: 17 x 3
## object_ branch_ island_
## <chr> <chr> <lgl>
## 1 eUUAPKPXPv SCxRwDeFVZ TRUE
## 2 eUUAPKPXPv VDEfUB0CNe TRUE
## 3 HHVqqD0Iey 47oOZqTQpy TRUE
## 4 HHVqqD0Iey AJwh6kOAv8 TRUE
## 5 HHVqqD0Iey MDE37idcvn TRUE
## 6 HHVqqD0Iey hqSlQ7pzNI TRUE
## 7 HHVqqD0Iey dsPcgdfVxO TRUE
## 8 HHVqqD0Iey cJLGj320At TRUE
## 9 HHVqqD0Iey A9oTyYEpIG TRUE
## 10 HHVqqD0Iey ZvPYofdOor TRUE
## 11 HHVqqD0Iey MGesuvEjiF TRUE
## 12 HHVqqD0Iey 5pyibtLU4E TRUE
## 13 HHVqqD0Iey E4dDv3jSPc TRUE
## 14 HHVqqD0Iey feU88P0LaL TRUE
## 15 HHVqqD0Iey OLVIOyjjmu TRUE
## 16 8WJJAfmCBf K1EuCY1Xwx TRUE
## 17 8WJJAfmCBf kqtMrcC9IL TRUE
The paths-link-vertex table records the relationship between vertices and paths (by default the de-duplication is done in X-Y but it could be done in other geometric spaces, e.g. in 1D time or 3D X-Y-Z or X-Y-Time).
This is the instances of vertices as opposed to the unique paired values of coordinates themselves.
ptabs$bXv## # A tibble: 557 x 3
## branch_ order_ vertex_
## <chr> <int> <chr>
## 1 SCxRwDeFVZ 1 dmJZ8jgGh0
## 2 SCxRwDeFVZ 2 7RZY8H2aLj
## 3 SCxRwDeFVZ 3 okbLVF4lgt
## 4 SCxRwDeFVZ 4 kDYO5w5sfW
## 5 SCxRwDeFVZ 5 iARu6Ba8NJ
## 6 SCxRwDeFVZ 6 wUOj4qW3qN
## 7 SCxRwDeFVZ 7 cUnEQxVz2t
## 8 SCxRwDeFVZ 8 QP2C0hpkhd
## 9 SCxRwDeFVZ 9 3pEoLY2X8z
## 10 SCxRwDeFVZ 10 9rOYEC2Exx
## # ... with 547 more rows
And finally the vertices. In this example there are fewer unique x-y vertex than there are instance of the vertices, not a one-to-one match. This discrepancy obviously increases greatly for layers with shared boundaries, though in this example it is mostly due to the final closing coordinate on each polygon path - it’s a repeated instance, but not a repeated vertex value. There is at least one shared edge in this layer, clearly the one between Indonesia and Papua New Guinea.
ptabs$v## # A tibble: 540 x 3
## x_ y_ vertex_
## <dbl> <dbl> <chr>
## 1 145. -40.8 dmJZ8jgGh0
## 2 146. -41.1 7RZY8H2aLj
## 3 147. -41.0 okbLVF4lgt
## 4 148. -40.8 kDYO5w5sfW
## 5 148. -40.9 iARu6Ba8NJ
## 6 148. -42.1 wUOj4qW3qN
## 7 148. -42.4 cUnEQxVz2t
## 8 148. -43.2 QP2C0hpkhd
## 9 148. -42.9 3pEoLY2X8z
## 10 147. -43.6 9rOYEC2Exx
## # ... with 530 more rows
From this form we can see clearly that polygons and lines in GIS are really the same thing, we have paths of coordinates and then rules about how they are used.
If we compare each entity table side by side it’s clear the only difference is whether a path is badged as an island vs. a hole.
For points we don’t need the paths or the order data, though for multipoints we do need branch.
ltabs <- spbabel::map_table(as(pnganz, "SpatialLinesDataFrame"))
for (i in seq_along(ltabs)) {
writeLines("------------------------------")
print(ptabs[i])
writeLines("")
}## ------------------------------
## $o
## # A tibble: 3 x 2
## name object_
## * <fct> <chr>
## 1 Australia eUUAPKPXPv
## 2 Indonesia HHVqqD0Iey
## 3 New Zealand 8WJJAfmCBf
##
##
## ------------------------------
## $b
## # A tibble: 17 x 3
## object_ branch_ island_
## <chr> <chr> <lgl>
## 1 eUUAPKPXPv SCxRwDeFVZ TRUE
## 2 eUUAPKPXPv VDEfUB0CNe TRUE
## 3 HHVqqD0Iey 47oOZqTQpy TRUE
## 4 HHVqqD0Iey AJwh6kOAv8 TRUE
## 5 HHVqqD0Iey MDE37idcvn TRUE
## 6 HHVqqD0Iey hqSlQ7pzNI TRUE
## 7 HHVqqD0Iey dsPcgdfVxO TRUE
## 8 HHVqqD0Iey cJLGj320At TRUE
## 9 HHVqqD0Iey A9oTyYEpIG TRUE
## 10 HHVqqD0Iey ZvPYofdOor TRUE
## 11 HHVqqD0Iey MGesuvEjiF TRUE
## 12 HHVqqD0Iey 5pyibtLU4E TRUE
## 13 HHVqqD0Iey E4dDv3jSPc TRUE
## 14 HHVqqD0Iey feU88P0LaL TRUE
## 15 HHVqqD0Iey OLVIOyjjmu TRUE
## 16 8WJJAfmCBf K1EuCY1Xwx TRUE
## 17 8WJJAfmCBf kqtMrcC9IL TRUE
##
##
## ------------------------------
## $bXv
## # A tibble: 557 x 3
## branch_ order_ vertex_
## <chr> <int> <chr>
## 1 SCxRwDeFVZ 1 dmJZ8jgGh0
## 2 SCxRwDeFVZ 2 7RZY8H2aLj
## 3 SCxRwDeFVZ 3 okbLVF4lgt
## 4 SCxRwDeFVZ 4 kDYO5w5sfW
## 5 SCxRwDeFVZ 5 iARu6Ba8NJ
## 6 SCxRwDeFVZ 6 wUOj4qW3qN
## 7 SCxRwDeFVZ 7 cUnEQxVz2t
## 8 SCxRwDeFVZ 8 QP2C0hpkhd
## 9 SCxRwDeFVZ 9 3pEoLY2X8z
## 10 SCxRwDeFVZ 10 9rOYEC2Exx
## # ... with 547 more rows
##
##
## ------------------------------
## $v
## # A tibble: 540 x 3
## x_ y_ vertex_
## <dbl> <dbl> <chr>
## 1 145. -40.8 dmJZ8jgGh0
## 2 146. -41.1 7RZY8H2aLj
## 3 147. -41.0 okbLVF4lgt
## 4 148. -40.8 kDYO5w5sfW
## 5 148. -40.9 iARu6Ba8NJ
## 6 148. -42.1 wUOj4qW3qN
## 7 148. -42.4 cUnEQxVz2t
## 8 148. -43.2 QP2C0hpkhd
## 9 148. -42.9 3pEoLY2X8z
## 10 147. -43.6 9rOYEC2Exx
## # ... with 530 more rows
The coordinate-path structures used above for polygons and lines are very explicit, and in traditional form cannot be used in a more abstract way. By collecting the attributes of the entities in use into their own tables we start to build this abstraction. The paths are represented as a sequence of identifiers, rather than the actual coordinate values themselves. Why do this? We can abstract the choice of what do with those coordinate away from their storage. We also get a limited form of topology, in that a change made to one vertex coordinate attribute is reflected in all of the paths that use that vertex, analogous the Shared Edit mode in Manifold 8.0.
The next step in topological relationships is to represent each segment of a line rather than the entire path. To do this we need a table of segments, and a link table to store the identity of the two vertices used by those segments.
This has been implemented in the package anglr, but (more recently) replaced by models in silicate.
lsegment <- silicate::SC(as(pnganz, "SpatialLinesDataFrame"))
as.data.frame(lapply(lsegment, nrow))## object object_link_edge edge vertex meta
## 1 3 540 540 540 1
rgl::rgl.clear()
library(geosphere)
library(anglr)
xyz <- proj4::ptransform(randomCoordinates(5e4) * pi/180,
"+init=epsg:4326",
"+proj=geocent +a=1")
tri <- geometry::convhulln(xyz)
rgl::triangles3d(xyz[t(tri), ],
specular = "black",
color = "skyblue", alpha = 0.4)
plot3d(globe(lsegment, "+proj=geocent +a=1.05"), add = TRUE)
rgl::rglwidget()